Data Visualization Final Project

Julianna Szabo

3/29/2021

In this project, we will be analysing how the earning differ between men and women in the US. This data was collected in 2019 so results may have changes since then.

Questions

Some of the questions I will be trying to answer in this report are:

  • Is there really a gender gap between earnings and number of females in certain positions?
  • Is the gender gap the same across industries or positions?
  • Are some age groups more affected?
  • Is there a difference between full-time and part-time?

Loading the data

We are loading three datasets that all include important information that we will be used in this anaylsis. The three datasets help bring more data to the same point, but cannot be combined since they all show different aspects of the same situation. The only thing they have in common the timeline, but that is not a good factor for merging.

Data understanding

For the data understanding, we will be focusing on the first table comparing wages across positions for both genders. This will be the main table used for analysis and the others will be used or merged as needed.

pander(summary(obs_gender))
Table continues below
year occupation major_category minor_category
Min. :2013 Length:2088 Length:2088 Length:2088
1st Qu.:2014 Class :character Class :character Class :character
Median :2014 Mode :character Mode :character Mode :character
Mean :2014 NA NA NA
3rd Qu.:2015 NA NA NA
Max. :2016 NA NA NA
NA NA NA NA
Table continues below
total_workers workers_male workers_female percent_female
Min. : 658 Min. : 0 Min. : 0 Min. : 0.00
1st Qu.: 18687 1st Qu.: 10765 1st Qu.: 2364 1st Qu.: 10.73
Median : 58997 Median : 32302 Median : 15238 Median : 32.40
Mean : 196055 Mean : 111515 Mean : 84540 Mean : 36.00
3rd Qu.: 187415 3rd Qu.: 102644 3rd Qu.: 63326 3rd Qu.: 57.31
Max. :3758629 Max. :2570385 Max. :2290818 Max. :100.00
NA NA NA NA
Table continues below
total_earnings total_earnings_male total_earnings_female
Min. : 17266 Min. : 12147 Min. : 7447
1st Qu.: 32410 1st Qu.: 35702 1st Qu.: 28872
Median : 44437 Median : 46825 Median : 40191
Mean : 49762 Mean : 53138 Mean : 44681
3rd Qu.: 61012 3rd Qu.: 65015 3rd Qu.: 54813
Max. :201542 Max. :231420 Max. :166388
NA NA’s :4 NA’s :65
wage_percent_of_male
Min. : 50.88
1st Qu.: 77.56
Median : 85.16
Mean : 84.03
3rd Qu.: 90.62
Max. :117.40
NA’s :846
plot(obs_gender)

Looks like the data is either linear or normal so we can use it for analysis. The only fields that look correlated are the ones that have been calculated from each other or that show the same measurements such as number of workers or earning.

plot(x = obs_gender$total_earnings_male, y = obs_gender$total_earnings_female)

This plot does make it look like where men ear a certain amount women ear less. However, since this plots everyone across all jobs this may be a bit general, so here we might be comparing apples to oranges. Therefore, we should look at it based on positions.

This data does include some time series-like aspect with the year variable, so that should be checked.

plot(x = obs_gender$year, y = obs_gender$total_workers)
plot(x = obs_gender$year, y = obs_gender$total_earnings)

Looking at both the total earning and the total number of workers, there seems to have not been much change in the period between 2013 and 2016. Therefore, this factor can be discarded when making comparisons.

Number of women and their income in a different industries

female <- obs_gender[, list(total_workers = sum(total_workers), 
                            percent_female = mean(percent_female), 
                            total_earnings = mean(total_earnings), 
                            wage_percent_of_male = mean(wage_percent_of_male, na.rm = TRUE)),
                     by = 'minor_category']

theme_custom <- theme(
  text = element_text(family = "Palatino", size = 12),
  legend.position = 'bottom',
  plot.background = element_rect(color = "black", size = 1)
)
ggplot( female, aes( x = minor_category, y = total_workers)) +
  geom_histogram( stat = 'identity') +
  theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Number of total workers per industry', y = 'Total number of workers')

ggplot( female, aes( x = minor_category, y = percent_female)) +
  geom_histogram( stat = 'identity') +
  theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Percentage of female workers per industry', y = 'Percentage of female workers')

Looks like there are definitely industries where there are significantly less women, such as Construction or Instalations. These are usually considered male professions so this gap is expected with less than 10% women. However, hopefully women find these professions for themselves as well in the future if they want to go into them. However, there are also industries such as Healthcare and Education which are usually considered female professions where the percentage of woemn is well over 70%.

ggplot( female, aes( x = minor_category, y = total_earnings)) +
  geom_histogram( stat = 'identity') +
  theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Yearly earning per industry', y = 'Salary ($)')

ggplot( female, aes( x = minor_category, y = wage_percent_of_male)) +
  geom_histogram( stat = 'identity') +
  theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = 'Percentage of female wage per industry', y = 'Percentage of female wage')

Interestingly while the pay varies significantly across all industries, the difference between men and women stays relatively stable at around 70%. This shows that this does seem to be a systematic issue and not something that just comes up in some industries or workplaces.

Change in wage percengate over the past 30 years

str(earnings_female)
## Classes 'data.table' and 'data.frame':   264 obs. of  3 variables:
##  $ Year   : int  1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
##  $ group  : chr  "Total, 16 years and older" "Total, 16 years and older" "Total, 16 years and older" "Total, 16 years and older" ...
##  $ percent: num  62.3 64.2 64.4 65.7 66.5 67.6 68.1 69.5 69.8 70.2 ...
##  - attr(*, ".internal.selfref")=<externalptr>
unique(earnings_female$group)
## [1] "Total, 16 years and older" "16-19 years"              
## [3] "20-24 years"               "25-34 years"              
## [5] "35-44 years"               "45-54 years"              
## [7] "55-64 years"               "65 years and older"

Looks like there is an additional Total column which I will excluse since it includes all the other categories therefore will not add more information.

wage_gap <- earnings_female[group != "Total, 16 years and older", ]

ggplot(wage_gap, aes(x = Year, y = percent, color = group)) +
  geom_point() +
  geom_smooth(method = 'lm', se = TRUE) +
  theme_bw() + theme_custom +
  transition_states(group) +
  labs(title = '{closest_state}', y = 'Percent of female wages of men (%)')
## `geom_smooth()` using formula 'y ~ x'

It is interesting to see that the younger the people are the smaller the gap is. This could be with the elderly when they started working the gap was bigger and they could only improve it so much. It could also be because of the above seen difference in work.

The makeup of the work force

ggplot(data = employed_gender) +
  geom_line(aes(x = year, y = full_time_female), color = 'maroon2') +
  geom_line(aes(x = year, y = full_time_male), color = 'blue') +
  theme_bw() + theme_custom +
  transition_reveal(year) +
  labs(title = 'Change in percentage of Full-time work', y = "Percentage of full-time workers")

ggplot(data = employed_gender) +
  geom_line(aes(x = year, y = part_time_female), color = 'maroon2') +
  geom_line(aes(x = year, y = part_time_male), color = 'blue') +
  theme_bw() + theme_custom +
  transition_reveal(year) +
  labs(title = 'Change in percentage of part-time work', y = "Percentage of part-time workers")

The blue line represents men and the pink one women.

It is really interesting how gender affects the type of jobs that are most common. While it has decreased over the past 30 years, men still hold full-time jobs in over 80% of cases. Women on the other hand only have full-time jobs in around 70-75% of cases and about 25% have part-time jobs. This is much higher than the around 10-15% of men.

Conclusion

Overall it seems that the gendergap is actually real in many ways. Women are not represented in some positions and are payed less across all types of positions amd have been for the past 30 years. This doesn’t change when they get older, it actually becomes worse. Further, they are also less likely to hold full-time jobs by about 10%. This is a very important to have this data layed out and shown. While this analysis doesn’t prove any type of causation, it does show some type of correlation between gender and amount earned.